Comparing many distributions

The extended movies dataset

import altair as alt
from vega_datasets import data

movies_extended = data.movies().dropna(subset=['Major_Genre'])
movies_extended
Title US_Gross Worldwide_Gross US_DVD_Sales ... Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
1 First Love, Last Rites 10876.0 10876.0 NaN ... None NaN 6.9 207.0
2 I Married a Strange Person 203134.0 203134.0 NaN ... None NaN 6.8 865.0
3 Let's Talk About Sex 373615.0 373615.0 NaN ... None 13.0 NaN NaN
... ... ... ... ... ... ... ... ... ...
3198 Zoom 11989328.0 12506188.0 6679409.0 ... Peter Hewitt 3.0 3.4 7424.0
3199 The Legend of Zorro 45575336.0 141475336.0 NaN ... Martin Campbell 26.0 5.7 21161.0
3200 The Mask of Zorro 93828745.0 233700000.0 NaN ... Martin Campbell 82.0 6.7 4789.0

2926 rows × 16 columns

Many distributions can’t be effectively compared with histograms

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('Worldwide_Gross', bin=alt.Bin(maxbins=30)),
    alt.Y('count()'),
    alt.Color('Major_Genre'))

Many distributions can’t be effectively compared with densities either

(alt.Chart(movies_extended).mark_area().transform_density(
    'Worldwide_Gross',
    groupby=['Major_Genre'],
    as_=['Worldwide_Gross', 'density'])
 .encode(
    alt.X('Worldwide_Gross'),
    alt.Y('density:Q'),
    alt.Color('Major_Genre')))

Bar charts are effective for comparing a single value per group but hides variation

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y('Major_Genre'))

Showing a single value can lead to incorrect conclusions

Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm

Barplot Hiding Points

Showing individual observations gives a richer representation than bar charts

alt.Chart(movies_extended).mark_tick().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'))

Tooltips are helpful for answering questions about specific observations

alt.Chart(movies_extended).mark_tick().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'),
    alt.Tooltip('Title:N'))

Heatmaps can compare multiple distributions without saturation

(alt.Chart(movies_extended).mark_rect().encode(
    alt.X('Worldwide_Gross', bin=alt.Bin(maxbins=100)),
    alt.Y('Major_Genre'),
    alt.Color('count()')))

Boxplots show several key statistics from a distribution

Jhguch at en.wikipedia via Wikimedia Commons



Barplot Hiding Points

Boxplots can effectively compare multiple distributions

bar = alt.Chart(movies_extended).mark_bar().encode(
    alt.X('mean(Worldwide_Gross)'),
    alt.Y('Major_Genre'))

box = alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre'))

box | bar

Sorted boxplots more effective for comparing similar distributions

genre_order = movies_extended.groupby(
    'Major_Genre')['Worldwide_Gross'].median().sort_values().index.tolist()
alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order))

Zooming in facilitates comparison of small differences

filtered_movies = movies_extended[movies_extended['Worldwide_Gross'] < 1_500_000_000]
alt.Chart(filtered_movies).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order))

Boxplots can be scaled by the number of observations

alt.Chart(movies_extended).mark_boxplot().encode(
    alt.X('Worldwide_Gross'),
    alt.Y('Major_Genre', sort=genre_order),
    alt.Size('count()'))

Boxplots are not able to accurately represent data with multiple peaks

From Autodesk research

Point Box Violin

Let’s apply what we learned!